Attention Is All You Need

Paper available on: https://arxiv.org/abs/1706.03762

Recap:

  • Traditionally, NLP tasks were approached using RNNs
  • Information flows thru hidden states that need to retain information
  • Decoder provides the next word once the value from hidden states is predicted
  • Inefficient because of the amount of steps and calculations
  • RNNs have a hard time learning long-term dependencies

Attention:

  • Decoder can decide to attent to hidden states in previous states of encoding
  • There's no need to use the whole chain
  • Outputs Keys (K) and Querys (Q)
  • Keys index hidden states via softmax

Attention is a major paradigm shift:

  • The source sentence is fed into inputs
  • The target sentence is fed into outputs
  • Output probability is the next word
  • In contrast to RNN, the output of a single token is one sample. There's no multistep backprop.

Multi-Head Attention:

  • Use attention over the input sequence (sentence)

Combining the source sequence with the multi-head attention when the attention

  • The encoder of the source sentence discovers interesting things & builds Key-Value pairs
  • The encoder of the target sentence builds the Queries
  • The Values of the source sentence are indexed using Keys
  • The Query part asks about what the network would like

Advantages:

  • Reduction in the need of computation steps
  • Shorter pathlinks
  • Implications for other machine learning fields (computer vision etc.)

Look at: